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Abstract 



We present a quasi-model-independent search for the physics responsible for electroweak symmetry 
breaking. We define final states to be studied, and construct a rule that identifies a set of relevant 
variables for any particular final state. A new algorithm ("Sleuth") searches for regions of excess 
in those variables and quantifies the significance of any detected excess. After demonstrating the 
sensitivity of the method, we apply it to the semi- inclusive channel efiX collected in 108 pb _1 of pp 
collisions at y/s — 1.8 TeV at the D0 experiment during 1992-1996 at the Fermilab Tevatron. We 
find no evidence of new high pr physics in this sample. 
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It is generally recognized that the standard model, an 
extremely successful description of the fundamental par- 
ticles and their interactions, must be incomplete. Al- 
though there is likely to be new physics beyond the cur- 
rent picture, the possibilities are sufficiently broad that 
the first hint could appear in any of many different guises. 
This suggests the importance of performing searches that 
are as model-independent as possible. 

The word "model" can connote varying degrees of gen- 
erality. It can mean a particular model together with 
definite choices of parameters [e.g., mSUGRA (!]] with 
specified m.1/2, mo, Aq, tan/3, and sign(^)]; it can mean 
a particular model with unspecified parameters (e.g., 
mSUGRA); it can mean a more general model (e.g., 
SUGRA); it can mean an even more general model (e.g., 
gravity-mediated supersymmetry); it can mean a class of 
general models (e.g., supersymmetry); or it can be a set 
of classes of general models (e.g., theories of electroweak 
symmetry breaking). As one ascends this hierarchy of 
generality, predictions of the "model" become less pre- 
cise. While there have been many searches for phenom- 
ena predicted by models in the narrow sense, there have 
been relatively few searches for predictions of the more 
general kind. 

In this article we describe an explicit prescription for 
searching for the physics responsible for stabilizing elec- 
troweak symmetry breaking, in a manner that relies only 
upon what we are sure we know about electroweak sym- 
metry breaking: that its natural scale is on the order 
of the Higgs mass pi. When we wish to emphasize the 
generality of the approach, we say that it is quasi-model- 
independent, where the "quasi" refers to the fact that the 
correct model of electroweak symmetry breaking should 
become manifest at the scale of several hundred GeV. 

New sources of physics will in general lead to an excess 
over the expected background in some final state. A gen- 
eral signature for new physics is therefore a region of vari- 
able space in which the probability for the background to 
fluctuate up to or above the number of observed events is 
small. Because the mass scale of electroweak symmetry 
breaking is larger than the mass scale of most standard 
model backgrounds, we expect this excess to populate 
regions of high transverse momentum (pr)- The method 
we will describe involves a systematic search for such ex- 
cesses (although with a small modification it is equally 
applicable to searches for deficits). Although motivated 
by the problem of electroweak symmetry breaking, this 
method is generally sensitive to any new high px physics. 

An important benefit of a precise a priori algorithm 
of the type we construct is that it allows an a posteri- 
ori evaluation of the significance of a small excess, in 
addition to providing a recipe for searching for such an 
effect. The potential benefit of this feature can be seen 
by considering the two curious events seen by the CDF 
collaboration in their semi-inclusive e/i sample |i) and 
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one event in the data sample we analyze in this article, 
which have prompted efforts to determine the probabil- 
ity that the standard model alone could produce such a 
result Q . This is quite difficult to do a posteriori, as one 
is forced to somewhat arbitrarily decide what is meant 
by "such a result." The method we describe provides an 
unbiased and quantitative answer to such questions. 

"Sleuth," a quasi-model-independent prescription for 
searching for high px physics beyond the standard model, 
has two components: 

• the definitions of physical objects and final states, 
and the variables relevant for each final state; and 

• an algorithm that systematically hunts for an ex- 
cess in the space of those variables, and quantifies 
the likelihood of any excess found. 



We describe the prescription in Sees. |J and III. In 
Sec. H we define the physical objects and final states, 
and we construct a rule for choosing variables relevant 
for any final state. In Sec. Ill we describe an algorithm 
that searches for a region of excess in a multidimensional 
space, and determines how unlikely it is that this excess 
arose simply from a statistical fluctuation, taking account 
of the fact that the search encompasses many regions of 
this space. This algorithm is especially useful when ap- 
plied to a large number of final states. For a first appli- 
cation of Sleuth, we choose the semi-inclusive e\i data set 
(epX) because it contains "known" signals (pair produc- 
tion of W bosons and top quarks) that can be used to 
quantify the sensitivity of the algorithm to new physics, 
and because this final state is prominent in several models 
of physics beyond the standard model (|||. In Sec. [lV| we 
describe the data set and the expected backgrounds from 
the standard model and instrumental effects. In Sec. |V| 
we demonstrate the sensitivity of the method by ignoring 
the existence of top quark and W boson pair production, 
and showing that the method can find these signals in 
the data. In Sec. VI we apply the Sleuth algorithm to 
the e\xX data set assuming the known backgrounds, in- 
cluding WW and ti, and present the results of a search 
for new physics beyond the standard model. 



methods of electroweak symmetry breaking — supersym- 
metry [Q , strong dynamics ^ , and theories incorporating 
large extra dimensions |J — the number of specific mod- 
els (and corresponding experimental signatures) is in the 
hundreds. Of these many specific models, at most one is 
a correct description of nature. 

Another issue is that the results of searches for new 
physics can be unintentionally biased because the number 
of events under consideration is small, and the details of 
the analysis are often not specified before the data are 
examined. An a priori technique would permit a detailed 
study without fear of biasing the result. 

We first specify the prescription in a form that should 
be applicable to any collider experiment sensitive to phys- 
ics at the electroweak scale. We then provide aspects 
of the prescription that are specific to D0. Other ex- 
periments wishing to use this prescription would specify 
similar details appropriate to their detectors. 



A. General prescription 

We begin by defining final states, and follow by mo- 
tivating the variables we choose to consider for each 
of those final states. We assume that standard par- 
ticle identification requirements, often detector-specific, 
have been agreed upon. The understanding of all back- 
grounds, through Monte Carlo programs and data, is cru- 
cial to this analysis, and requires great attention to de- 
tail. Standard methods for understanding backgrounds 
— comparing different Monte Carlos, normalizing back- 
ground predictions to observation, obtaining instrumen- 
tal backgrounds from related samples, demonstrating 
agreement in limited regions of variable space, and cali- 
brating against known physical quantities, among many 
others — are needed and used in this analysis as in any 
other. Uncertainties in backgrounds, which can limit the 
sensitivity of the search, are naturally folded into this 
approach. 



1. Final states 



II. SEARCH STRATEGY 

Most recent searches for new physics have followed a 
well-defined set of steps: first selecting a model to be 
tested against the standard model, then finding a mea- 
surable prediction of this model that differs as much as 
possible from the prediction of the standard model, and 
finally comparing the predictions to data. This is clearly 
the procedure to follow for a small number of compelling 
candidate theories. Unfortunately, the resources required 
to implement this procedure grow almost linearly with 
the number of theories. Although broadly speaking there 
are currently only three models with internally consistent 



In this subsection we partition the data into final 
states. The specification is based on the notions of ex- 
clusive channels and standard particle identification. 

a. Exclusiveness. Although analyses are frequently 
performed on inclusive samples, considering only exclu- 
sive final states has several advantages in the context of 
this approach: 

• the presence of an extra object (electron, photon, 
muon, . . . ) in an event often qualitatively affects 
the probable interpretation of the event; 

• the presence of an extra object often changes the 
variables that are chosen to characterize the final 
state; and 
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• using inclusive final states can lead to ambiguities 
when different channels are combined. 

We choose to partition the data into exclusive categories. 

b. Particle identification. We now specify the label- 
ing of these exclusive final states. The general principle 
is that we label the event as completely as possible, as 
long as we have a high degree of confidence in the la- 
bel. This leads naturally to an explicit prescription for 
labeling final states. 

Most multipurpose experiments are able to identify 
electrons, muons, photons, and jets, and so we begin by 
considering a final state to be described by the number of 
isolated electrons, muons, photons, and jets observed in 
the event, and whether there is a significant imbalance in 
transverse momentum (fir)- We treat Ifix as an object in 
its own right, which must pass certain quality criteria. If 
6-tagging, c-tagging, or r-tagging is possible, then we can 
differentiate among jets arising from b quarks, c quarks, 
light quarks, and hadronic tau decays. If a magnetic field 
can be used to obtain the electric charge of a lepton, we 
split the charged leptons I into £ + and l~ but consider 
final states that are related through global charge conju- 
gation to be equivalent in pp or e + e~ (but not pp) colli- 
sions. Thus e + e~7 is a different final state than e + e + 7, 
but e + e + 7 and e~e~7 together make up a single final 
state. The definitions of these objects are logically spec- 
ified for general use in all analyses, and we use these 
standard identification criteria to define our objects. 

We can further specify a final state by identifying any 
W or Z bosons in the event. This has the effect (for ex- 
ample) of splitting the eejj, (1/J.jj, and rrjj final states 
into the Zjj, eejj, n/ijj, and rrjj channels, and split- 
ting the e^rjj, ^frjj, and r^xjj final states into Wjj, 
eft T jj, nftrjj, and T$ T jj channels. 

We combine a £ + £~ pair into a Z if their invari- 
ant mass Mi+t- falls within a Z boson mass window 
(82 < M i+e - < 100 GeV for D0 data) and the event 
contains neither significant $t nor a third charged lep- 
ton. If the event contains exactly one photon in addition 
to a £ + £~ pair, and contains neither significant Ifix nor a 
third charged lepton, and if M f +£- does not fall within 
the Z boson mass window, but M^+£- 7 does, then the 
£ + £~j triplet becomes a Z boson. If the experiment is 
not capable of distinguishing between £ + and £~ and the 
event contains exactly two -TS, they are assumed to have 
opposite charge. A lepton and Jfir become a W boson 
if the transverse mass M^L is within a W boson mass 
window (30 < ML < 110 GeV for D0 data) and the 
event contains no second charged lepton. Because the W 
boson mass window is so much wider than the Z boson 
mass window, we make no attempt to identify radiative 
W boson decays. 

We do not identify top quarks, gluons, nor W or Z 
bosons from hadronic decays because we would have lit- 
tle confidence in such a label. Since the predicted cross 
sections for new physics are comparable to those for the 
production of detectable ZZ, WZ, and WW final states, 



we also elect not to identify these final states. 

c. Choice of final states to study. Because it is not 
realistic to specify backgrounds for all possible exclusive 
final states, choosing prospective final states is an im- 
portant issue. Theories of physics beyond the standard 
model make such wide-ranging predictions that neglect of 
any particular final state purely on theoretical grounds 
would seem unwise. Focusing on final states in which 
the data themselves suggest something interesting can 
be done without fear of bias if all final states and vari- 
ables for those final states are defined prior to examining 
the data. Choosing variables is the subject of the next 
section. 



2. Variables 

We construct a mapping from each final state to a list 
of key variables for that final state using a simple, well- 
motivated, and short set of rules. The rules, which are 
summarized in Table ||, are obtained through the follow- 
ing reasoning: 

• There is strong reason to believe that the physics 
responsible for electroweak symmetry breaking oc- 
curs at the scale of the mass of the Higgs boson, 
or on the order of a few hundred GeV. Any new 
massive particles associated with this physics can 
therefore be expected to decay into objects with 
large transverse momenta in the final state. 

• Many models of electroweak symmetry breaking 
predict final states with large missing transverse 
energy. This arises in a large class of i?-parity con- 
serving supersymmetric theories containing a neu- 
tral, stable, lightest supersymmetric particle; in 
theories with "large" extra dimensions containing 
a Kaluza-Klein tower of gravitons that escape into 
the multidimensional "bulk space" ||; and more 
generally from neutrinos produced in electroweak 
boson decay. If the final state contains significant 
fx, then Jfjf is included in the list of promising 
variables. We do not use Ifix that is reconstructed 
as a W boson decay product, following the pre- 
scription for W and Z boson identification outlined 
above. 

• If the final state contains one or more leptons we 
use the summed scalar transverse momenta X^Pti 
where the sum is over all leptons whose identity 
can be determined and whose momenta can be ac- 
curately measured. Leptons that are reconstructed 
as W or Z boson decay products are not included 
in this sum, again following the prescription for 
W and Z boson identification outlined above. We 
combine the momenta of e, /z, and r leptons be- 
cause these objects are expected to have compara- 
ble transverse momenta on the basis of lepton uni- 
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versality in the standard model and the negligible 
values of lepton masses. 

• Similarly, photons and W and Z bosons are most 
likely to signal the presence of new phenomena 
when they are produced at high transverse momen- 
tum. Since the expected transverse momenta of the 
electroweak gauge bosons are comparable, we use 
the variable Y pl/ W ^ Z \ where the scalar sum is over 
all electroweak gauge bosons in the event, for final 
states with one or more of them identified. 

• For events with one jet in the final state, the trans- 
verse energy of that jet is an important variable. 
For events with two or more jets in the final state, 
previous analyses have made use of the sum of the 
transverse energies of all but the leading jet flio| |. 
The reason for excluding the energy of the leading 
jet from this sum is that while a hard jet is often ob- 
tained from QCD radiation, hard second and third 
radiative jets are relatively much less likely. We 
therefore choose the variable Y' Pr t° describe the 
jets in the final state, where Y^ Pr denotes jp^ if 
the final state contains only one jet, and Y™-2 Pr 
if the final state contains two or more jets. Since 
QCD dijets are a large background in all-jets fi- 
nal states, Y' Pt re f ers instead to Yl^Pr f° r n ~ 
nal states containing n jets and nothing else, where 
n > 3. 

When there are exactly two objects in an event (e.g., 
one Z boson and one jet), their px values are expected 
to be nearly equal, and we therefore use the average pr 
of the two objects. When there is only one object in an 
event (e.g., a single W boson), we use no variables, and 
simply perform a counting experiment. 

Other variables that can help pick out specific signa- 
tures can also be defined. Although variables such as 
invariant mass, angular separation between particular fi- 
nal state objects, and variables that characterize event 
topologies may be useful in testing a particular model, 
these variables tend to be less powerful in a general 
search. Appendix ^ contains a more detailed discussion 
of this point. In the interest of keeping the list of vari- 
ables as general, well-motivated, powerful, and short as 
possible, we elect to stop with those given in Table @. We 
expect evidence for new physics to appear in the high 

tails of the $t, YPt? YIPt •> an< ^ Y' Pt distribu- 
tions. 



B. Search strategy: D0 Run I 

The general search strategy just outlined is applica- 
ble to any collider experiment searching for the physics 
responsible for electroweak symmetry breaking. Any par- 
ticular experiment that wishes to use this strategy needs 
to specify object and variable definitions that reflect the 



If the final state includes then consider the variable 

one or more charged leptons Y.Pt 
one or more electroweak bosons Y.P-r W ^ Z 
one or more jets Y' Pt 

TABLE I. A quasi-model-independently motivated list of 
interesting variables for any final state. The set of variables 
to consider for any particular final state is the union of the 
variables in the second column for each row that pertains to 
that final state. Here £ denotes e, fi, or r. The notation 
Y' Pt i s shorthand for if the final state contains only 
one jet, YjI^iP't if ^ ne nna l state contains n > 2 jets, and 
S™=3 Pt t ne final state contains n jets and nothing else, 
with n > 3. Leptons and missing transverse energy that are 
reconstructed as decay products of W or Z bosons are not 
considered separately in the left-hand column. 

capabilities of the detector. This section serves this func- 
tion for the D0 detector jll| in its 1992-1996 run (Run 
I) at the Fermilab Tevatron. Details in this subsection 
supersede those in the more general section above. 

1. Object definitions 

The particle identification algorithms used here for 
electrons, muons, jets, and photons are similar to those 
used in many published D0 analyses. We summarize 
them here. 

a. Electrons. D0 had no central magnetic field in 
Run I; therefore, there is no way to distinguish be- 
tween electrons and positrons. Electron candidates with 
transverse energy greater than 15 GeV, within the fidu- 
cial region of | rj |< 1.1 or 1.5 <| r\ |< 2.5 (where 
i] = — lntan(6>/2), with 9 the polar angle with respect 
to the colliding proton's direction), and satisfying stan- 
dard electron identification and isolation requirements as 
defined in Ref. jl2| are accepted. 

b. Muons. We do not distinguish between positively 
and negatively charged muons in this analysis. We accept 
muons with transverse momentum greater than 15 GeV 
and | n |< 1.7 that satisfy standard muon identification 
and isolation requirements |l2] | . 

c. ]pT ■ The missing transverse energy, Jftx, is the 
energy required to balance the measured energy in the 
event. In the calorimeter, we calculate 

$r al =\^Ej sin 6i (cos fa x + sin fay)], (1) 

i 

where i runs over all calorimeter cells, Ei is the energy 
deposited in the i th cell, and fa is the azimuthal and 9i 
the polar angle of the center of the i th cell, measured 
with respect to the event vertex. 

An event is defined to contain a Ifir "object" only if we 
are confident that there is significant missing transverse 
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energy. Events that do not contain muons are said to 
contain ]pT if Jfir^ > 15 GeV. Using track deflection in 
magnetized steel toroids, the muon momentum resolution 
in Run I is 



6(l/p) = 0.18(p - 2)/p 2 © 0.003, 



(2) 



where p is in units of GeV, and the © means addition in 
quadrature. This is significantly coarser than the electro- 
magnetic and jet energy resolutions, parameterized by 



5E/E = 15%/VS©0.3% 



and 



5E/E = 80%/ VE, 



(3) 



(4) 



respectively. Events that contain exactly one muon are 
deemed to contain I^t on the basis of muon number con- 
servation rather than on the basis of the muon momen- 
tum measurement. We do not identify a J^t object in 
events that contain two or more muons. 

d. Jets. Jets are reconstructed in the calorimeter 
using a fixed-size cone algorithm, with a cone size of 
AR = V(A0) 2 + (A77) 2 =0,5 §. We require jets to 
have Et > 15 GeV and \ rj\< 2.5. We make no attempt 
to distinguish among light quarks, gluons, charm quarks, 
bottom quarks, and hadronic tau decays. 

e. Photons. Isolated photons that pass standard 
identification requirements |l4| ], have transverse energy 
greater than 15 GeV, and are in the fiducial region 
|t^|< 1.1 or 1.5 <|??|< 2.5 are labeled photon objects. 

/. W bosons. Following the general prescription de- 
scribed above, an electron (as defined above) and $t be- 
come a W boson if their transverse mass is within the W 
boson mass window (30 < MjL T < 110 GeV), and the 
event contains no second charged lepton. Because the 
muon momentum measurement is coarse, we do not use 
a transverse mass window for muons. From Sec. |^, any 
event containing a single muon is said to also contain fx] 
thus any event containing a muon and no second charged 
lepton is said to contain a W boson. 

g. Z bosons. We use the rules in the previous section 
for combining an ee pair or ee-y triplet into a Z boson. 
We do not attempt to reconstruct a Z boson in events 
containing three or more charged leptons. For events 
containing two muons and no third charged lepton, we 
fit the event to the hypothesis that the two muons are 
decay products of a Z boson and that there is no $t 
in the event. If the fit is acceptable, the two muons are 
considered to be a Z boson. 



2. Variables 

The variables provided in the general prescription 
above also need minor revision to be appropriate for the 
D0 experiment. 



a. ^2px- We do not attempt to identify r leptons, 
and the momentum resolution for muons is coarse. For 
events that contain no leptons other than muons, we de- 
fine ^2pt = ^2Pj>- F° r events that contain one or more 
electrons, we define J^Pt — TIPt- This is identical to 
the general definition provided above except for events 
containing both one or more electrons and one or more 
muons. In this case, we have decided to define X^Pt as 
the sum of the momenta of the electrons only, rather than 
combining the well-measured electron momenta with the 
poorly-measured muon momenta. 

b. $ T - $t is defined by $ T = $t\ where flr 1 is the 
missing transverse energy as summed in the calorimeter. 
This sum includes the pr of electrons, but only a negli- 
gible fraction of the px of muons. 



c - Pt m IZj . We use the definition of ^ p7j w ^ z pro- 
vided in the general prescription: the sum is over all elec- 
troweak gauge bosons in the event, for final states with 
one or more of them. We note that if a W boson is formed 
from a p, and fx, then p^ = f!r ■ 



III. SLEUTH ALGORITHM 

Given a data sample, its final state, and a set of vari- 
ables appropriate to that final state, we now describe the 
algorithm that determines the most interesting region in 
those variables and quantifies the degree of interest. 



A. Overview 

Central to the algorithm is the notion of a "region" 
(R). A region can be regarded simply as a volume in 
the variable space defined by Table Q, sat isfyin g certain 



special properties to be discussed in Sec. [II B. The re- 



gion contains N data points and an expected number of 
background events 6r. We can consequently compute the 
weighted probability p 1 ^, defined in Sec. Ill C 1 , that the 
background in the region fluctuates up to or beyond the 
observed number of events. If this probability is small, 
we flag the region as potentially interesting. 

In any reasonably-sized data set, there will always be 
regions in which the probability for bn to fluctuate up to 
or above the observed number of events is small. The rel- 
evant issue is how often this can happen in an ensemble of 
hypothetical similar experiments (rise's). This question 
can be answered by performing these hypothetical simi- 
lar experiments; i.e., by generating random events drawn 
from the background distribution, finding the least prob- 
able region, and repeating this many times. The fraction 
of hypothetical similar experiments that yields a proba- 
bility as low as the one observed in the data provides the 
appropriate measure of the degree of interest. 

Although the details of the algorithm are complex, the 
interface is straightforward. What is needed is a data 
sample, a set of events for each background process i, 
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and the number of background events &j ± 8b i from each 
background process expected in the data sample. The 
output gives the region of greatest excess and the fraction 
of hypothetical similar experiments that would yield such 
an excess. 

The algorithm consists of seven steps: 

1. Define regions R about any chosen set of TV = 
1, . . . , iVdata data points in the sample of TVdata data 
points. 

2. Estimate the background bn expected within these 
R. 

3. Calculate the weighted probabilities p N that ba can 
fluctuate to > TV. 

4. For each TV, determine the R for which p N is min- 
imum. Define pn = min/j (p N ). 

5. Determine the fraction P/v of hypothetical similar 
experiments in which the pTv(hse) is smaller than 
the observed (data). 

6. Determine the iV for which Pjv is minimized. De- 
fine P = rnhiM (Pn)- 

7. Determine the fraction V of hypothetical similar 
experiments in which the P(hse) is smaller than 
the observed P(data). 

Our notation is such that a lowercase p represents a prob- 
ability, while an uppercase P or V represents the frac- 
tion of hypothetical similar experiments that would yield 
a less probable outcome. The symbol representing the 
minimization of p N over R, pn over TV, or P^ over TV is 
written without the superscript or subscript representing 
the varied property (i.e., pn, P, or P, respectively). The 
rest of this section discusses these steps in greater detail. 



B. Steps 1 and 2: Regions 

When there are events that do not appear to follow 
some expected distribution, such as the event at x = 61 
in Fig. [lj we often attempt to estimate the probability 
that the event is consistent with coming from that distri- 
bution. This is generally done by choosing some region 
around the event (or an accumulation of events), inte- 
grating the background within that region, and comput- 
ing the probability that the expected number of events 
in that region could have fluctuated up to or beyond the 
observed number. 

Of course, the calculated probability depends on how 
the region containing the events is chosen. If the region 
about the event is infinitesimal, then the expected num- 
ber of background events in the region (and therefore 
this probability) can be made arbitrarily small. A pos- 
sible approach in one dimension is to define the region 
to be the interval bounded below by the point halfway 



between the interesting event and its nearest neighbor, 
and bounded above by infinity. For the case shown in 
Fig. [|, this region would be roughly the interval (46, oo). 
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FIG. 1. Example of a data set with a potentially anomalous 
point. The solid histogram is the expected distribution, and 
the points with error bars are the data. The bulk of the data 
is well described by the background prediction, but the point 
located at x = 61 appears out of place. 



Such a prescription breaks down in two or more di- 
mensions, and it is not entirely satisfactory even in one 
dimension. In particular, it is not clear how to proceed 
if the excess occurs somewhere other than at the tail end 
of a distribution, or how to generalize the interval to a 
well-defined contour in several dimensions. As we will 
see, there are significant advantages to having a precise 
definition of a region about a potentially interesting set 
of data points. This is provided in Sec. 1IIB2, after we 
specify the variable space itself. 



1. Variable transformation 

Unfortunately, the region that we choose about the 
point on the tail of Fig. [j] changes if the variable is some 
function of x, rather than x itself. If the region about 
each data point is to be the subspace that is closer to 
that point than to any other one in the sample, it would 
therefore be wise to minimize any dependence of the se- 
lection on the shape of the background distribution. For 
a background distributed uniformly between and 1 (or, 
in d dimensions, uniform within the unit "box" [0,1]), 
it is reasonable to define the region associated with an 
event as the variable subspace closer to that event than 
to any other event in the sample. If the background is 
not already uniform within the unit box, we transform 
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the variables so that it becomes uniform. The details of 
this transformation are provided in Appendix [b|. 

With the background distribution trivialized, the rest 
of the analysis can be performed within the unit box 
without worrying about the background shape. A con- 
siderable simplification is therefore achieved through this 
transformation. The task of determining the expected 
background within each region, which would have re- 
quired a Monte Carlo integration of the background dis- 
tribution over the region, reduces to the problem of de- 
termining the volume of each region. The problem is 
now completely specified by the transformed coordinates 
of the data points, the total number of expected back- 
ground events b, and its uncertainty 5 b. 

2. Voronoi diagrams 




FIG. 2. A Voronoi diagram, (a) The seven data points are 
shown as black dots; the lines partition the space into seven 
regions, with one region belonging to each data point, (b) An 
example of a 2-region. 



Having defined the variable space by requiring a uni- 
form background distribution, we can now define more 
precisely what is meant by a region. Figure || shows 
a 2-dimensional variable space V containing seven data 
points in a unit square. For any v £ V, we say that v 
belongs to the data point Di if | v — Di \ <\ v — Dj | for 
all j i; that is, v belongs to Di if v is closer to Di 
than to any other data point. In Fig. ||(a), for example, 
any v lying within the variable subspace defined by the 
pentagon in the upper right-hand corner belongs to the 
data point located at (0.9,0.8). The set of points in V 
that do not belong to any data point [those points on the 
lines in Fig. ||(a)] has zero measure and may be ignored. 

We define a region around a set of data points in a 
variable space V to be the set of all points in V that are 
closer to one of the data points in that set than to any 
data points outside that set. A region around a single 
data point is the union of all points in V that belong to 
that data point, and is called a 1-region. A region about 
a set of N data points is the union of all points in V that 
belong to any one of the data points, and is called an N- 
region; an example of a 2-region is shown as the shaded 
area in Fig. ||(b). Adata data points thus partition V into 
Adata 1-regions. Two data points are said to be neighbors 
if their 1-regions share _a border - the points at (0.75, 0.9) 



and (0.9,0.8) in Fig 
diagram such as Fig. p 



for example, are neighbors. A 
(a), showing a set of data points 
and their regions, is known as a Voronoi diagram. We 
use a program called HULL pM for this computation. 



3. Region criteria 

The explicit definition of a region that we have just 
provided reduces the number of contours we can draw in 
the variable space from infinite to a mere 2 Ardata — 1, since 
any region either contains all of the points belonging to 
the i th data event or it contains none of them. In fact, 
because many of these regions have a shape that makes 



them implausible as "discovery regions" in which new 
physics might be concentrated, the number of possible 
regions may be reduced further. For example, the re- 
gion in Fig. H containing only the lower-leftmost and the 
upper-rightmost data points is unlikely to be a discovery 
region, whereas the region shown in Fig. ||(b) contain- 
ing the two upper-rightmost data points is more likely 
(depending upon the nature of the variables). 

We can now impose whatever criteria we wish upon 
the regions that we allow Sleuth to consider. In general 
we will want to impose several criteria, and in this case 
we write the net criterion cr = c R c R ... as a product 
of the individual criteria, where c R is to be read "the 
extent to which the region R satisfies the criterion c\" 
The quantities c R take on values in the interval [0,1], 
where c R — > if R badly fails c 1 , and c R — > 1 if R easily 
satisfies c % . 

Consider as an example c = AntiCornerSphere, a sim- 
ple criterion that we have elected to impose on the regions 
in the e/j,X sample. Loosely speaking, a region R will sat- 
isfy this criterion (cr — ► 1) if all of the data points inside 
the region are farther from the origin than all of the data 
points outside the region. This situation is shown, for 
example, in Fig. |^(b). For every event i in the data set, 
denote by the distance of the point in the unit box to 
the origin, let r' be r transformed so that the background 
is uniform in r' over the interval [0, 1], and let r[ be the 
values r, so transformed. Then define 



cr 




< 1 



(5) 



where r fir ^ 

vv ^ ^ ' mm 



{r'i), Cax = maxigij (r<), and 
£ = 1/(4 Areata) is an average separation distance between 
data points in the variable r'. 

Notice that in the limit of vanishing £, the criterion c 
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becomes a boolean operator, returning "true" when all 
of the data points inside the region are farther from the 
origin than all of the data points outside the region, and 
"false" otherwise. In fact, many possible criteria have a 
scale £ and reduce to boolean operators when £ vanishes. 
This scale has been introduced to ensure continuity of 
the final result under small changes in the background 
estimate. In this spirit, the "extent to which R satisfies 
the criterion c" has an alternative interpretation as the 
"fraction of the time R satisfies the criterion c," where the 
average is taken over an ensemble of slightly perturbed 
background estimates and £ is taken to vanish, so that 
"satisfies" makes sense. We will use c R in the next section 
to define an initial measure of the degree to which R is 
interesting. 

We have considered several other criteria that could 
be imposed upon any potential discovery region to en- 
sure that the region is "reasonably shaped" and "in a 
believable location." We discuss a few of these criteria in 
Appendix ^|. 



C. Step 3: Probabilities and uncertainties 

Now that we have specified the notion of a region, we 
can define a quantitative measure of the "degree of inter- 
est" of a region. 



1. Probabilities 

Since we are looking for regions of excess, the appro- 
priate measure of the degree of interest is a slight modifi- 
cation of the probability of background fluctuating up to 
or above the observed number of events. For an iV-region 
R in which b R background events are expected and b R is 
precisely known, this probability is 



E 



e- bR {b R ) 1 



(6) 



-N 



We use this to define the weighted probability 



P»- « + (!-«>, P) 



\i=N 



which one can also think of as an "average probability," 
where the average is taken over the ensemble of slightly 
perturbed background estimates referred to above. By 
construction, this quantity has all of the properties we 
need: it reduces to the probability in Eq. |6| in the limit 
that R easily satisfies the region criteria, it saturates at 
unity in the limit that R badly fails the region criteria, 
and it exhibits continuous behavior under small pertur- 
bations in the background estimate between these two 
extremes. 



2. Systematic uncertainties 

The expected number of events from each background 
process has a systematic uncertainty that must be taken 
into account. There may also be an uncertainty in the 
shape of a particular background distribution — for ex- 
ample, the tail of a distribution may have a larger sys- 
tematic uncertainty than the mode. 

The background distribution comprises one or more 
contributing background processes. For each background 
process we know the number of expected events and the 
systematic uncertainty on this number, and we have a set 
of Monte Carlo points that tell us what that background 
process looks like in the variables of interest. A typical 
situation is sketched in Fig. [| 
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FIG. 3. An example of a one-dimensional background dis- 
tribution with three sources. The normalized shapes of the in- 
dividual background processes are shown as the dashed lines; 
the solid line is their sum. Typically, the normalizations for 
the background processes have separate systematic errors. 
These errors can change the shape of the total background 
curve in addition to its overall normalization. For example, if 
the long-dashed curve has a large systematic error, then the 
solid curve will be known less precisely in the region (3, 5) 
than in the region (0, 3) where the other two backgrounds 
dominate. 



The 



multivariate 



transformation 



described in Sec. [II B 1 is obtained assuming that the 
number of events expected from each background pro- 
cess is known precisely. This fixes each event's position 
in the unit box, its neighbors, and the volume of the sur- 
rounding region. The systematic uncertainty Sb R on the 
number of background events in a given region is com- 
puted by combining the systematic uncertainties for each 
individual background process. Eq. [t] then generalizes to 



Pn = C R 



E 

i—N 



exp 



V2n{Sb R ) 
(b-b R ) 2 \ 



db 



2(6b R ) 2 , 
+ (1 - c R ), 



(8) 
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which is seen to reduce to Eq. [?] in the limit 8b r — > 0. 

This formulation provides a way to take account of sys- 
tematic uncertainties on the shapes of distributions, as 
well. For example, if there is a larger systematic uncer- 
tainty on the tail of a distribution, then the background 
process can be broken into two components, one describ- 
ing the bulk of the distribution and one describing the 
tail, and a larger systematic uncertainty assigned to the 
piece that describes the tail. Correlations among the var- 
ious components may also be assigned. 

We vary the number of events generated in the hypo- 
thetical similar experiments according to the systematic 
and statistical uncertainties. The systematic errors are 
accounted for by pulling a vector of the "true" number 
of expected background events b from the distribution 

P(b) = exp (—fo - b^ib, &,-)) , (9) 

where bi is the number of expected background events 
from process i, as before, and 6j is the i th component of 
b. We have introduced a covariance matrix E, which is 
diagonal with components Sjj = (Sbi) 2 in the limit that 
the systematic uncertainties on the different background 
processes are uncorrelated, and we assume summation on 
repeated indices in Eq. |^. The statistical uncertainties in 
turn are allowed for by choosing the number of events Ni 
from each background process i from the Poisson distri- 
bution 

e~ b *b m 

P(Ni) = — ^rf— , (10) 

where bi is the i th component of the vector b just deter- 
mined. 



D. Step 4: Exploration of regions 

Knowing how to calculate p N for a specific iV-region 
R allows us to determine which of two iV-regions is more 
interesting. Specifically, an ./V-region R\ is more interest- 
ing than another iV-region R 2 if p* 1 < Pn 2 ■ This allows 
us to compare regions of the same size (the same N), 
although, as we will see, it does not allow us to compare 
regions of different size. 

Step 4 of the algorithm involves finding the most in- 
teresting iV-region for each fixed TV between 1 and A^ata- 
This most interesting ./V-region is the one that minimizes 
Pjj, and these pn — min^p^) are needed for the next 
step in the algorithm. 

Even for modestly sized problems (say, two dimen- 
sions with on the order of 100 data points), there are 
far too many regions to consider an exhaustive search. 
We therefore use a heuristic to find the most interesting 
region. We imagine the region under consideration to be 
an amoeba moving within the unit box. At each step in 



the search the amoeba either expands or contracts ac- 
cording to certain rules, and along the way we keep track 
of the most interesting iV-region so far found, for each 
N . The detailed rules for this heuristic are provided in 
Appendix 

E. Steps 5 and 6: Hypothetical similar experiments, 

Part I 

At this point in the algorithm the original events have 
been reduced to A^ata values, each between and 1: the 
Pn (N — 1, . . . , A*data) corresponding to the most inter- 
esting ^-regions satisfying the imposed criteria. To find 
the most interesting of these, we need a way of compar- 
ing regions of different size (different N). An Aq-region 
R^i with pjf^ is more interesting than an A^-region Rn 2 
with p^v 2 ta if the fraction of hypothetical similar experi- 
ments in which p^ < p^f 1 ta is less than the fraction of 
hypothetical similar experiments in which p^ < pjf^. 

To make this comparison, we generate Ar hsc i hypothet- 
ical similar experiments. Generating a hypothetical sim- 
ilar experiment involves pulling a random integer from 
Eq. [To|for each background process i, sampling this num- 
ber of events from the multidimensional background den- 
sity b(x), and then transforming these events into the unit 
box. 

For each hse we compute a list of pn , exactly as for 
the data set. Each of the N hsc i hypothetical similar ex- 
periments consequently yields a list of pn- For each N, 
we now compare the pn we obtained in the data (p^f ta ) 

with the pat's we obtained in the hse's (p^ 6i , where 
i = 1, . . . , N hsc i). From these values we calculate Pn, 
the fraction of hse's with p^ <p A ata : 

^ = ^-E e (^ ata -^ se< )' ( n ) 

where 0(a;) = for x < 0, and 0(x) = 1 for x > 0. 

The most interesting region in the sample is then the 
region for which Pn is smallest. We define P = P/v min , 
where Pfv mln is the smallest of the Pn ■ 

F. Step 7: Hypothetical similar experiments, Part II 

A question that remains to be answered is what frac- 
tion V of hypothetical similar experiments would yield 
a P less than the P obtained in the data. We calculate 
V by running a second set of N hsc 2 hypothetical similar 
experiments, generated as described in the previous sec- 
tion. (We have written hse 1 above to refer to the first set 
of hypothetical similar experiments, used to determine 
the Pjv, given a list of pn', we write hse 2 to refer to this 
second set of hypothetical similar experiments, used to 
determine V from P.) A second, independent set of hse's 
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is required to calculate an unbiased value for P. The 
quantity P is then given by 



P = 



1 



N hsc 2 



P 



( I a I : i 



(12) 



This is the final measure of the degree of interest of the 
most interesting region. Note that V is a number be- 
tween and 1, that small values of V indicate a sample 
containing an interesting region, that large values of V 
indicate a sample containing no interesting region, and 
that V can be described as the fraction of hypothetical 
similar experiments that yield a more interesting result 
than is observed in the data. V can be translated into 
units of standard deviations {P\a]) by solving the unit 
conversion equation 



V 



1 



2tt J-p 



~ t2 ' 2 dt 



(13) 



for Vi 



G. Interpretation of results 

In a general search for new phenomena, Sleuth will 
be applied to iVf s different final states, resulting in N[ s 
different values for V . The final step in the procedure is 
the combination of these results. If no V value is smaller 
than ss 0.01 then a null result has been obtained, as no 
significant signal for new physics has been identified in 
the data. 

If one or more of the V values is particularly low, then 
we can surmise that the region(s) of excess corresponds 
either to a poorly modeled background or to possible 
evidence of new physics. The algorithm has pointed out 
a region of excess (1Z) and has quantified its significance 
(V). The next step is to interpret this result. 

Two issues related to this interpretation are combining 
results from many final states, and confirming a Sleuth 
discovery. 



1. Combining the results of many final states 

If one looks at many final states, one expects even- 
tually to see a fairly small P, even if there really is no 
new physics in the data. We therefore define a quantity 
P to be the fraction of hypothetical similar experimental 
run^ that yield a P that is smaller than the smallest P 



x In the phrase "hypothetical similar experiment," "experi- 
ment" refers to the analysis of a single final state. We use 
"experimental runs" in a similar way to refer to the analy- 
sis of a number of different final states. Thus a hypothetical 
similar experimental run consists of 7Vf s different hypothetical 
similar experiments, one for each final state analyzed. 



observed in the data. Explicitly, given Nf s final states, 
with bi background events expected in each, and Pi cal- 
culated for each one, P is given to good approximation 

byR 



i=l j=0 J 

where rii is the smallest integer satisfying 



E 



< Pmin = minPi 



(14) 



(15) 



2. Confirmation 

An independent confirmation is desirable for any po- 
tential discovery, especially for an excess revealed by a 
data-driven search. Such confirmation may come from 
an independent experiment, from the same experiment 
in a different but related final state, from an indepen- 
dent confirmation of the background estimate, or from 
the same experiment in the same final state using inde- 
pendent data. In the last of these first sample 
can be presented to Sleuth to uncover any hints of new 
physics, and the remaining sample can be subjected to a 
standard analysis in the region suggested by Sleuth. An 
excess in this region in the second sample helps to con- 
firm a discrepancy between data and background. If we 
see hints of new physics in the Run I data, for example, 
we will be able to predict where new physics might show 
itself in the upcoming run of the Fermilab Tevatron, Run 
II. 



IV. THE e^iX DATA SET 

As mentioned in Sec. |, we have applied the Sleuth 
method to D0 data containing one or more electrons 
and one or more muons. We use a data set correspond- 
ing to 108.3±5.7 pb _1 of integrated luminosity, collected 
between 1992 and 1996 at the Fermilab Tevatron with the 
D0 detector. The data set and basic selection criteria are 
identical to those used in the published ti cross section 
analysis for the dilepton channels [Q. Specifically, we 
apply global cleanup cuts and select events containing 



2 Note that the naive expression V = 1 — (1 — Pmin) 



not correct, since this requires V — * 1 for iVf s — + oo, and 
there are indeed an infinite number of final states to examine. 
The resolution of this paradox hinges on the fact that only 
an integral number of events can be observed in each final 
state, and therefore final states with bi <C 1 contribute very 
little to the value of V. This is correctly accounted for in the 
formulation given in Eq. 
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Data set 


Fakes 


Z — > TT 


7* — > TT 


WW 


tt 


Total 


P 1 1 WiT 


18.4±1.4 


25 6±6 5 


0.5±0.2 


3.9±1.0 


011±0 003 


48.5±7.6 




8.7±1.0 


3.0 ±0.8 


0.1±0.03 


1.1±0.3 


0.4±0.1 


13.2±1.5 




2.7±0.6 


0.5±0.2 


0.012±0.006 


0.18±0.05 


1.8±0.5 


5.2±0.8 


efiprjjj 


0.4±0.2 


0.07±0.05 


0.005±0.004 


0.032±0.009 


0.7±0.2 


1.3±0.3 


e/j,X 


30.2±1.8 


29.2±4.5 


0.7±0.1 


5.2±0.8 


3.1±0.5 


68.3±5.7 



TABLE III. The number of expected background events for the populated final states within efiX. The errors on e/j,X are 
smaller than on the sum of the individual background contributions obtained from Monte Carlo because of an uncertainty on 
the number of extra jets arising from initial and final state radiation in the exclusive channels. 



Final State 


Variables 




Pt, $t 




Pt, $t, P 3 t 




Pt, $t, Pt 1 




p T , $ T ,P 3 ^ +P J T 3 



TABLE II. The exclusive final states within efiX for which 
events are seen in the data and the variables used for each 
of these final states. The variables are selected using the 
prescription described in Sec. [n|. Although all final states 
contain "efi^T," no missing transverse energy cut has been 
applied explicitly; ffir is inferred from the presence of the 
muon, following Sec. [IB 



• one or more high pt (pr > 15 GeV) isolated elec- 
trons, and 

• one or more highpr (pr > 15 GeV) isolated muons, 



with object definitions given in Sec. [LTJ 

The dominant standard model and instrumental back- 
grounds to this data set are 

• top quark pair production with t — ► Wb, and with 
both W bosons decaying leptonically, one to ev (or 
to rv — > evvv) and one to [iv (or to tv — ► \xvvv), 

• W boson pair production with both W bosons de- 
caying leptonically, one to ev (or to rv — ► evvv) 
and one to \xv (or to tv — + \xvvv), 

• Z/j* — > tt — > efivvvv, and 

• instrumental ("fakes"): W production with the W 
boson decaying to fiv and a radiated jet or photon 
being mistaken for an electron, or bb/cc production 
with one heavy quark producing an isolated muon 
and the other a false electron p3| . 

A sample of 100,000 tt — > dilepton events was gener- 
ated using HERWIG 16 1, and a WW sample of equal 
size was generated using PYTHIA jl7|]. We generated 
7* — > tt — > e^Lvvvv (Drell-Yan) events using PYTHIA 
and Z — > tt — > e\xvvvv events using IS A JET M . The 
Drell-Yan cross section is normalized as in Ref . |n3 ] . The 
cross section for Z — > tt is taken to be equal to the 
published D0 Z — > ee cross section |2Cf| ; the top quark 
production cross section is taken from Ref. pl|; and the 



WW cross section is taken from Ref. p3]. The tt, WW, 
and Z/j* Monte Carlo events all were processed through 
GEANT |H and the D0 reconstruction software. The 
number and distributions of events containing fake elec- 
trons are taken from data, using a sample of events sat- 
isfying "bad" electron identification criteria [g4[ . 

We break e^iX into exclusive data sets, and determine 
which variables to consider in each set using the prescrip- 
tion given in Sec. |fi|. The exclusive final states within 
e\iX that are populated with events in the data are listed 
in Table ||. The number of events expected for the vari- 
ous samples and data sets in the populated final states are 
given in Table III; the number of expected background 
events in all unpopulated final states in which the num- 
ber of expected background events is > 0.001 are listed 
in Table The dominant sources of systematic error 
are given in Table |v|. 



Final State 


Background expected 




0.3 ±0.15 




0.10 ±0.05 


e/i/i 


0.04 ± 0.02 




0.06 ± 0.03 



TABLE IV. The number of expected background events 
for the unpopulated final states within e/j,X. The expected 
number of events in final states with additional jets is obtained 
from those listed in the table by dividing by five for each jet. 
These are all rough estimates, and a large systematic error has 
been assigned accordingly. Since no events are seen in any of 
these final states, the background estimates shown here are 
used solely in the calculation of V for all efiX channels. 



V. SENSITIVITY 

We choose to consider the efiX final state first be- 
cause it contains backgrounds of mass scale compara- 
ble to that expected of the physics responsible for elec- 
troweak symmetry breaking. Top quark pair production 
(qq — > it — > W + W~bb) and W boson pair production are 
excellent examples of the type of physics that we would 
expect the algorithm to find. 

Before examining the data, we decided to impose the 
requirements of AntiCornerSphere and Isolation (see Ap- 
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Source 


Error 


Trigger and lepton identification efficiencies 


12% 


P(j -tV) 


7% 


Multiple Interactions 


7% 


Luminosity 


5.3% 


a(tt—> efiX) 


12% 


a(Z — > tt — ► e/iX) 


10% 


a(WW -> e/uZ) 


10% 


<j(j* — > tt — > e^iX) 


17% 


Jet modeling 


20% 



TABLE V. Sources of systematic uncertainty on the num- 
ber of expected background events in the final states efil^T, 
efifJrj, efifrjj, and e/i^Tjjj. P(j — +"e") denotes the prob- 
ability that a jet will be reconstructed as an electron. "Jet 
modeling" includes systematic uncertainties in jet production 
in PYTHIA and HERWIG in addition to jet identification 
and energy scale uncertainties. 



pendix ^) on the regions that Sleuth is allowed to con- 
sider. The reason for this choice is that, in addition to 
allowing only "reasonable" regions, it allows the search 
to be parameterized essentially by a single variable - 
the distance between each region and the lower left-hand 
corner of the unit box. We felt this would aid the in- 
terpretation of the results from this initial application of 
the method. 

We test the sensitivity in two phases, keeping in mind 
that nothing in the algorithm has been "tuned" to find- 
ing WW and ti in this sample. We first consider the 
background to comprise fakes and Z/j* — > tt only, to 
see if we can "discover" either WW or ti. We then con- 
sider the background to comprise fakes, Z/j* — > tt, and 
WW, to see whether we can "discover" tt. We apply the 
full search strategy and algorithm in both cases, first (in 
this section) on an ensemble of mock samples, and then 
(in Sec. VI) on the data. 



A. Search for WW and tt in mock samples 

In this section we provide results from Sleuth for the 
case in which Z/j* — ► tt and fakes are included in the 
background estimates and the signal from WW and tt is 
"unknown." We apply the prescription to the exclusive 
e.[iX final states listed in Table |l[ 

Figure ^ shows distributions of V for mock samples 
containing only Z 'f*f — > tt and fakes, where the mock 
events are pulled randomly from their parent distribu- 
tions and the numbers of events are allowed to vary 
within systematic and statistical errors. The distribu- 
tions are uniform in the interval [0,1], as expected, be- 
coming appropriately discretized in the low statistics 
limit. (When the number of expected background events 
b£l, as in Fig. |](d), it can happen that zero or one 
events are observed. If zero events are observed then 
V = 1, since all hypothetical similar experiments yield a 



result as interesting or more interesting than an empty 
sample. If one event is observed then there is only one 
region for Sleuth to consider, and V is simply the prob- 
ability for &± 8b to fluctuate up to exactly one event. In 
Fig. [|(d), for example, the spike at V = 1 contains 62% 
of the mock experiments, since this is the probability for 
0.5 ± 0.2 to fluctuate to zero events; the second spike 
is located at V — 0.38 and contains 28% of the mock 
experiments, since this is the probability for 0.5 ± 0.2 
to fluctuate to exactly one event. Similar but less pro- 
nounced behavior is seen in Fig. ^(c).) Figure [| shows 
distributions of V when the mock samples contain WW 
and ti in addition to the background in Fig. ||. Again, 
the number of events from each process is allowed to vary 
within statistical and systematic error. Figure H shows 
that we can indeed find tt and/or WW much of the time. 
Figure ^ shows V computed for these samples. In over 
50% of these samples we find V[ a ] to correspond to more 
than two standard deviations. 
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FIG. 4. Distributions of V for the four exclusive final states 
(a) en$ T , (b) en$rj, (c) e^$ T jj, and (d) e^$ T jjj. The 
background includes only Z/7* — * tt and fakes, and the mock 
samples making up these distributions also contain only these 
two sources. As expected, V is uniform in the interval [0, 1] for 
those final states in which the expected number of background 
events b 3> 1, and shows discrete behavior for 6^1. 



B. Search for tt in mock samples 

In this section we provide results for the case in which 
Z/7* — * tt, fakes, and WW are all included in the back- 
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FIG. 5. Distributions of V for the four exclusive final states 
(a) e^$T, (b) e/i^Tj, (c) efifirjj, and (d) e^rjjj- The 
background includes only Z/7* — > rr and fakes. The mock 
samples for these distributions contain WW and tt in addi- 
tion to Z/'y* — * tt and fakes. The extent to which these 
distributions peak at small V can be taken as a measure of 
Sleuth's ability to find WW or tt if we had no knowledge of 
either final state. The presence of WW in efitfir causes the 
trend toward small values in (a) ; the presence of tt causes the 
trend toward small values in (c) and (d); and a combination 
of WW and tt causes the signal seen in (b). 



ground estimate, and tt is the "unknown" signal. We 
again apply the prescription to the exclusive final states 
listed in Table |fi. 

Figure ^ shows distributions of V for mock samples 
containing Z/j* — * tt, fakes, and WW, where the mock 
events are pulled randomly from their parent distribu- 
tions, and the numbers of events are allowed to vary 
within systematic and statistical errors. As found in 
the previous section, the distributions are uniform in the 
interval [0,1], becoming appropriately discretized when 
the expected number of background events becomes < 1 . 
Figure || shows distributions of V when the mock sam- 
ples contain tt in addition to Z/"f* — > tt, fakes, and 
WW. Again, the number of events from each process is 
allowed to vary within statistical and systematic errors. 
The distributions in Figs. ||(c) and (d) show that we can 
indeed find tt much of the time. Figure || shows that the 
distribution of V[ a ] is approximately a Gaussian centered 
at zero of width unity for the case where the background 
and data both contain Z/7* — > tt, fakes, and WW pro- 
duction, and is peaked in the bin above 2.0 for the same 
background when the data include tt. 
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FIG. 6. Distribution of V\ a \ from combining the four exclu- 
sive final states efilpT, efiftrj, e^ilprji, and e^rjjj. The 
background includes only Z/7* — » tt and fakes. The mock 
samples making up the distribution shown as the solid line 
contain WW and tt in addition to Z/7* — > tt and fakes, and 
correspond to Fig. || the mock samples making up the distri- 
bution shown as the dashed line contain only Z/'y* — + tt and 
fakes, and correspond to Fig. ^j. All samples with V[ a \ > 2.0 
appear in the rightmost bin. The fact that V[ a \ > 2.0 in 50% 
of the mock samples can be taken as a measure of Sleuth's 
sensitivity to finding WW and tt if we had no knowledge of 
the existence of the top quark or the possibility of W boson 
pair production. 



C. New high pr physics 



We have shown in Sees. VA and VB that the Sleuth 



prescription and algorithm correctly finds nothing when 
there is nothing to be found, while exhibiting sensitivity 
to the expected presence of WW and tt in the efiX sam- 
ple. Sleuth's performance on this "typical" new physics 
signal is encouraging, and may be taken as some measure 
of the sensitivity of this method to the great variety of 
new high pt physics that it has been designed to find. 
Making a more general claim regarding Sleuth's sensi- 
tivity to the presence of new physics is difficult, since 
the sensitivity obviously varies with the characteristics 
of each candidate theory. 

That being said, we can provide a rough estimate of 
Sleuth's sensitivity to new high pt physics with the fol- 
lowing argument. We have seen that we are sensitive 
to WW and tt pair production in a data sample cor- 
responding to an integrated luminosity of m 100 pb _1 . 
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FIG. 7. Distributions of V for the four exclusive final states 
(a) e/i$T, (b) en$rj, (c) efifirjj, and (d) e^rjjj- The 
background includes — > rr, fakes, and WW, and the 
mock samples making up these distributions also contain 
these three sources. As expected, V is uniform in the interval 
[0, 1] for those final states in which the expected number of 
background events 6 1, and shows discrete behavior when 
b < 1. 



These events tend to fall in the region pj, > 40 GeV, 

$ T > 40 GeV, and Yl' Pt > 40 GeV ( if there are an y 
jets at all). The probability that any true e[iX event pro- 
duced will make it into the final sample is about 15% due 
to the absence of complete hermeticity of the D0 detec- 
tor, inefficiencies in the detection of electrons and muons, 
and kinematic acceptance. We can therefore state that 
we are as sensitive to new high px physics as we were to 
the roughly eight WW and tt events in our mock samples 
if the new physics is distributed relative to all standard 
model backgrounds as WW and tt are distributed rela- 
tive to backgrounds from Z/j* — > tt and fakes alone, 
and if its production cross section x branching ratio into 
this final state is > 8/(0.15 x 100 pb _1 ) « 600 fb. Read- 
ers who are interested in a possible signal with a different 
relative distribution, or who prefer a more rigorous def- 
inition of "sensitivity," should adjust this cross section 
accordingly. 



VI. RESULTS 

In the previous section we studied what can be ex- 
pected when Sleuth is applied to e/iX mock samples. In 
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FIG. 8. Distributions of V for the four exclusive final states 
(a) ey,$ T , (b) enftrj, (c) efi$ T jj, and (d) ep.p T jjj. The 
background includes Z/"/* — » tt, fakes, and WW. The 
mock samples for these distributions contain tt in addition 
to Z/"/* — > tt, fakes, and WW. The extent to which these 
distributions peak at small V can be taken as a measure of 
Sleuth's sensitivity to finding tt if we had no knowledge of the 
top quark's existence or characteristics. Note that V is flat 
in e(j,]pT, where the expected number of top quark events is 
negligible, peaks slightly toward small values in e^ilprj, and 
shows a marked low peak in ejj,^rjj and efi^rjjj- 



this section we confront Sleuth with data. We observe 
39 events in the e^ifr final state, 13 events in e/xrfjTj, 
5 events in efifirjj, and a single event in e/x^rjjj, in 
good agreement with the expected background in Ta- 
ble III. We proceed by first removing both WW and tt 



from the background estimates, and next by removing 
only ti, to search for evidence of these processes in the 
data. Finally, we include all standard model processes in 
the background estimates and search for evidence of new 
physics. 



A. Search for WW and tt in data 

The results of applying Sleuth to D0 data with only 
Z/7* — > tt and fakes in the background estimate are 
shown in Table VI and Fig. O. Sleuth finds indications 



of an excess in the efiJpT and efiflxjj states, presum- 
ably reflecting the presence of WW and ti, respectively. 
The results for the e/^Tj and e/ifTjjj final states are 
consistent with the results in Fig. |5|. Defining r' as the 
distance of the data point from (0, 0, 0) in the unit box 
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FIG. 9. Distribution of "P^i from combining the four exclu- 
sive final states efilpT, efiJfirj, e^Tjj, and e^Tjjj- The 
background includes Z/-y* — > rr, fakes, and WW . The mock 
samples making up the distribution shown as the solid line 
contain in addition to Z/'y* — > rr, fakes, and WW, corre- 
sponding to Fig. ^; the mock samples making up the distri- 
bution shown as the dashed line contain only Z/7* — > rr, 
fakes, and WW, and correspond to Fig. All samples 
with Pia] > 2.0 appear in the rightmost bin. The fact that 
TP [a] > 2.0 in over 25% of the mock samples can be taken as 
a measure of Sleuth's sensitivity to finding tt if we had no 
knowledge of the top quark's existence or characteristics. 



Data set 


V 




0.008 




0.34 




0.01 


en$ T jjj 


0.38 


V 


0.03 



TABLE VI. Summary of results on the efxJpr, efxJprj, 
efifrjj, and e\xJpTjjj channels when WW and tt are not in- 
cluded in the background. Sleuth identifies a region of excess 
in the e/i|?T and efi^rjj final states, presumably indicating 
the presence of WW and tt in the data. In units of standard 
deviation, V\ a \ = 1-9. 
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(transformed so that the background is distributed uni- 
formly in the interval [0,1]), the top candidate events 
from D0's recent analysis |2J| are the three events with 
largest r' in the e/i^Tj j sample and the single event in 
the e/i^Tjjj sample, shown in Fig. [l^. The presence of 
the WW signal can be inferred from the events desig- 
nated interesting in the efilpT final state. 



B. Search for tt in data 

The results of applying Sleuth to the data with Z/j* — > 
tt, fakes, and WW included in the background estimate 
are shown in Table VII and Fig. |ll|. Sleuth finds an 
indication of excess in the enJfirjj events, presumably 
indicating the presence of tt. The results for the e/i$r, 
efifxj, and efJtWrjjj final states are consistent with the 
results in Fig. [|. The tt candidates from D0's recent 
analysis ||2a] are the three events with largest r' in the 
e^fxjj sample, and the single event in the efifJxjjj sam- 



FIG. 10. Positions of data points following the transforma- 
tion of the background from fake and Z/f* sources in the 
space of variables in Table |l] to a uniform distribution in the 
unit box. The darkened points define the region Sleuth found 
most interesting. The axes of the unit box in (a) are sugges- 
tively labeled (pt) and ($t)\ each is a function of both p^ 
and Ipr, but (pf*) depends more strongly on p^, while ($t) 
more closely tracks IpT- r' is the distance of the data point 
from (0,0,0) (the "lower left-hand corner" of the unit box), 
transformed so that the background is distributed uniformly 
in the interval [0, 1]. The interesting regions in the efilpT and 
efiJfirjj samples presumably indicate the presence of WW 
signal in e\ilpT and of signal in efifJrjj- We find V = 0.03 
0P W = 1.9). 



pie, shown in Fig. 

A comparison of this result with one obtained using 
a dedicated top quark search illustrates an important 
difference between Sleuth's result and the result from a 
dedicated search. D0 announced its discovery of the top 
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Data set 


V 


P 1 1 WiT 


0.16 




0.45 




0.03 




0.41 


V 


0.11 



TABLE VII. Summary of results on the ep$T, epfrj, 
eiifrjj, and ep^rjjj channels when ti production is not in- 
cluded in the background. Sleuth identifies a region of excess 
in the efi^Jrjj final state, presumably indicating the presence 
of ti in the data. In units of standard deviation, = 1.2. 



sample, a 1.9a "effect," when complete ignorance of the 
top quark is feigned. When we take into account the fact 
that we have also searched in all of the final states listed 
in Table we find V = 0.11, a 1.2a "effect." The dif- 
ference between the 2.75a "effect" seen with a dedicated 
top quark search and the 1.2a "effect" that Sleuth re- 
ports in e/iX lies partially in the fact that Sleuth is not 
optimized for tt; and partially in the careful accounting 
of the many new physics signatures that Sleuth consid- 
ered in addition to tt production, and the correspond- 
ingly many new physics signals that Sleuth might have 
discovered. 
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FIG. 11. Positions of data points following the transforma- 
tion of the background from the three sources Z/*y* — > tt, 
fakes, and WW in the space of variables in Table | to a uni- 
form distribution in the unit box. The darkened points define 
the region Sleuth found most interesting. The interesting re- 
gion in the efifrjj sample presumably indicates the presence 
of ti. We find V = 0.11 (Vu] = 1.2). 



C. Search for physics beyond the standard model 

In this section we present Sleuth's results for the case in 
which all standard model and instrumental backgrounds 
are considered in the background estimate: Z/j* — ► tt, 



fakes, WW, and tt. The results are shown in Table VIII 
and Fig. [12|. We observe excellent agreement with the 
standard model. We conclude that these data contain no 
evidence of new physics at high pt, and calculate that 
a fraction V = 0.72 of hypothetical similar experimen- 
tal runs would produce a more significant excess than 
any observed in these data. Recall that we are sensi- 
tive to new high pt physics with production cross sec- 
tion x bran ching ratio into this final state as described 
in Sec. |VC. 



Data set 


V 




0.14 




0.45 




0.31 




0.71 


V 


0.72 



TABLE VIII. Summary of results on all final states within 
efiX when all standard model backgrounds are included. The 
unpopulated final states (listed in Table [Tv| ) have V = 1.0; 
these final states are included in the calculation of V. We 
observe no evidence for the presence of new high pt physics. 



quark |^6| in 1995 with 50 pb _1 of integrated luminosity 
upon observing 17 events with an expected background 
of 3.8 ±0.6 events, a 4.6a "effect," in the combined dilep- 
ton and single-lepton decay channels. In the e/x channel 
alone, two events were seen with an expected background 
of 0.12 ± 0.03 events. The probability of 0.12 ± 0.03 
events fluctuating up to or above two events is 0.007, 
corresponding to a 2.5ct "effect." In a subsequent mea- 
surement of the top quark cross section |12|, three can- 
didate events were seen with an expected background 
of 0.21 ± 0.16, an excess corresponding to a 2.75cr "ef- 
fect." Using Sleuth, we find V = 0.03 in the e^if T jj 



VII. CONCLUSIONS 

We have developed a quasi-model-independent tech- 
nique for searching for the physics responsible for stabi- 
lizing electroweak symmetry breaking. Our prescription 
involves the definition of final states and the construction 
of a rule that identifies a set of relevant variables for any 
particular final state. An algorithm (Sleuth) systemati- 
cally searches for regions of excess in those variables, and 
quantifies the significance of any observed excess. This 
technique is sufficiently a priori that it allows an ex post 
facto, quantitative measure of the degree to which curious 
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FIG. 12. Positions of the data points following the trans- 
formation of the background from Z/7* — > tt, fakes, WW , 
and tt sources in the space of variables in Table | to a uniform 
distribution in the unit box. The darkened points define the 
region that Sleuth chose. We find V = 0.72, and distributions 
that are all roughly uniform and consistent with background. 
No evidence for new high pr physics is observed. 



events are interesting. After demonstrating the sensitiv- 
ity of the method, we have applied it to the set of events 
in the semi-inclusive channel efiX. Removing WW and 
tt from the calculated background, we find indications 
of these signals in the data. Including these background 
channels, we find that these data contain no evidence of 
new physics at high px- A fraction V = 0.72 of hypo- 
thetical similar experimental runs would produce a more 
significant excess than any observed in these data. 
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APPENDIX A: FURTHER COMMENTS ON 
VARIABLES 

We have excluded a number of "standard" variables 
from the list in Table j| for various reasons: some are help- 
ful for specific models but not helpful in general; some 
are partially redundant with variables already on the list; 
some we have omitted because we felt they were less well- 
motivated than the variables on the list, and we wish to 
keep the list of variables short. Two of the perhaps most 
significant omissions are invariant masses and topological 
variables. 

• Invariant masses: If a particle of mass m is pro- 
duced and its decay products are known, then the 
invariant mass of those decay products is an obvi- 
ous variable to consider. Mj v and M e + £ - are used 
in this spirit to identify W and Z bosons, respec- 
tively, as described in Sec. [nj. Unfortunately, a non- 
standard-model particle's decay products are gen- 
erally not known, both because the particle itself is 
not known and because of final state combinator- 
ics, and resolution effects can wash out a mass peak 
unless one knows where to look. Invariant masses 
turn out to be remarkably ineffective for the type of 
general search we wish to perform. For example, a 
natural invariant mass to consider in e/i7/5rj j is the 
invariant mass of the two jets (rrijj); since top quark 
events do not cluster in this variable, they would 
not be discovered by its use. A search for any par- 
ticular new particle with known decay products is 
best done with a dedicated analysis. For these rea- 
sons the list of variables in Table | does not include 
invariant masses. 

• Shape variables: Thrust, sphericity, aplanarity, 
centrality, and other topological variables of- 
ten prove to be good choices for model-specific 
searches, but new physics could appear in a variety 
of topologies. Many of the processes that could 
show up in these variables already populate the 
tails of the variables in Table [| If a shape variable 
is included, the choice of that particular variable 
must be justified. We choose not to use topological 
variables, but we do require physics objects to be 
central (e.g., \ r]j |< 2.5), to similar effect. 



APPENDIX B: TRANSFORMATION OF 
VARIABLES 

The details of the variable transformation are most 
easily understood in one dimension, and for this we 
can consider again Fig. |l|. It is easy to show that if 
the background distribution is described by the curve 
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b(x) = ^e~ x / 5 and we let y — 1 — e~ x ^ 5 , then y is dis- 
tributed uniformly between and 1. The situation is 
more complicated when the background is given to us as 
a set of Monte Carlo points that cannot be described by 
a simple parameterization, and it is further complicated 
when these points live in several dimensions. 

There is a unique solution to this problem in one di- 
mension, but an infinity of solutions in two or more di- 
mensions. Not all of these solutions are equally reason- 
able, however — there are two additional properties that 
the solution should have. 

• Axes should map to axes. If the data live in a three- 
dimensional space in the octant with all coordinates 
positive, for example, then it is natural to map the 
coordinate axes to the axes of the box. 

• Points that are near each other should map to 
points that are near each other, subject to the con- 
straint that the resulting background probability 
distribution be flat within the unit box. 

This somewhat abstract and not entirely well-posed 
problem is helped by considering an analogous physical 
problem: 

The height of the sand in a d-dimensional unit 
sandbox is given by the function b(x), where 
x is a d-component vector. (The counting of 
dimensions is such that a physical sandbox 
has d = 2.) We take the d-dimensional lid 
of the sandbox and squash the sand flat. The 
result of this squashing is that a sand grain at 
position x has moved to a new position y, and 
the new function b'(y) describing the height 
of the sand is a constant. Given the function 
b(x), determine the mapping x — > y. 

For this analogy to help, the background first needs 
to be put "in the sandbox." Each of the background 
events must also have the same weight (the reason for this 
will become clear shortly). The background probability 
density is therefore estimated in the original variables 
using Probability Density Estimation p^J , and M events 
are sampled from this distribution. 

These M events are then put "into the sandbox" by 
transforming each variable (individually) into the interval 
[0, 1]. The new variable is given by 



1 

M 



M 

£ 



i 



2no~j h 



exp 



dt. 



(Bl) 



where fiij is the value of the j variable for the i back- 
ground event, <Tj is the standard deviation of the distri- 
bution in the j th variable, and h = A/~~, where d is 
the dimensionality of the space. 



The next step is to take these M events and map each 
of them to a point on a uniform grid within the box. 
The previous paragraph defines a mapping from the orig- 
inal variables into the unit sandbox; this step defines a 
mapping from a lumpy distribution in the sandbox to a 
flat distribution. The mapping is continued to the entire 
space by interpolating between the sampled background 
events. 

The mapping to the grid is done by first assigning each 
sampled background point to an arbitrary grid point. 
Each background point i is some distance dij away from 
the grid point j with which it is paired. We then loop 
over pairs of background points i and i' ', which are associ- 
ated with grid points j and j', and swap the associations 
(associate i with j' and i' with j) if max(dy , di'j') > 
max(di>j,dij>). This looping and swapping is continued 
until an equilibrium state is reached. 



APPENDIX C: REGION CRITERIA 



In Sec. HI B 3 we introduced the formal notion of region 
criteria — properties that we require a region to have for 
it to be considered by Sleuth. The two criteria that we 
have decided to impose in the analysis of the e/iX data 
are Isolation and AntiCornerSphere. 

a. Isolation We want the region to include events 

that are very close to it. We define £ = jiV da j? a as a 
measure of the mean distance between data points in 
their transformed coordinates, and call a region isolated 
if there exist no data points outside the region that are 
closer than £ to a data point inside the region. We gener- 
alize this boolean criterion to the interval [0, 1] by defin- 
ing 



Relation = min 



min | (x)' m - (x) out I 
2£ 



(CI) 



where the minimum is taken over all pairwise combina- 
tions of data points with (x) m inside R and (x) ont outside 
R. 

b. AntiCornerSphere One must be able to draw a 
sphere centered on the origin of the unit box containing 
all data events outside the region and no data events 
inside the region. This is useful if the signal is expected 
to lie in the upper right-hand corner of the unit box. We 
generalize this b oolean criterion to the interval [0, 1] as 
described in Sec. Ill B 3. 



A number of other potentially useful region criteria may 
be imagined. Among those that we have considered 
are Connectivity, Convexity, Peg, and Hyperplanes. Al- 
though we present only the boolean forms of these crite- 
ria here, they may be generalized to the interval [0, 1] by 
introducing the scale £ in the same spirit as above. 

c. Connectivity We generally expect a discovery re- 
gion to be one connected subspace in the variables we use, 
rather than several disconnected subspaces. Although 
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one can posit cases in which the signal region is not con- 
nected (perhaps signal appears in the two regions rj > 2 
and rj < —2), one should be able to easily avoid this with 
an appropriate choice of variables. (In this example, we 
should use 1 77 1 rather than 77.) We defined the concept 
of n eighbor ing data points in the discussion of regions in 
Sec. Ill B 2 . A connected region is defined to be a region 
in which given any two points a and b within the region, 
there exists a list of points px = a,p2, ■ ■ ■ ,p n -i,Pn — b 
such that all the Pi are in the region and Pi+\ is a neigh- 
bor of pi . 

d. Convexity We define a non-convex region as a re- 
gion defined by a set of N data points P, such that there 
exists a data point p not within P satisfying 



/v 



^2pi\i =p 

i=l 

5> = i 

i 

Xi>0 VL 



(C2) 

(C3) 
(C4) 



for suitably chosen Aj, where pt are the points within P. 
A convex region is then any region that is not non-convex; 
intuitively a convex region is one that is "roundish," 
without protrusions or intrusions. 

e. Peg We may want to consider only regions that 
live on the high tails of a distribution. More generally, 
we may want to only consider regions that contain one 
or more of n specific points in variable space. Call this 
set of points 5,, where i = 1, . . . , n. We transform these 



points exactly as we transformed the data in Sec. Ill B 
to obtain a set of points yi that live in the unit box. A 
region R is said to be pegged to these points if there exists 
at least one i £ 1, . . . , n such that the closest data point 
to yi lies within R. 

f. Hyperplanes Connectivity and Convexity are cri- 
teria that require the region to be "reasonably-shaped," 
while Peg is designed to ensure that the region is "in a be- 
lievable location." It is possible, and may at times be de- 
sirable, to impose a criterion that judges both shape and 
location simultaneously. A region R in a <i-dimensional 
unit box is said to satisfy Hyperplanes if, for each data 
point p inside R, one can draw a (d — l)-dimensional 
hyperplane through p such that all data points on the 
side of the hyperplane containing the point 1 (the "up- 
per right-hand corner of the unit box" ) are inside R. 

More complicated region criteria may be built from com- 
binations and variations of these and other basic ele- 
ments. 



APPENDIX D: SEARCH HEURISTIC DETAILS 



for an amoeba to move within the unit box. We monitor 
the amoeba's progress by maintaining a list of the most 
interesting region of size N (one for each N) that the 
amoeba has visited so far. At each state, the amoeba is 
the region under consideration, and the rules tell us what 
region to consider next. 

The initial location and size of the amoeba is deter- 
mined by the following rules for seeding: 

1 . If we have not yet searched this data set at all, the 
starting amoeba fills the entire box. 

2. Otherwise, the amoeba starts out as the region 
around a single random point that has not yet in- 
habited a "small" region that we have considered 
so far. We consider a region R to be small if adding 
or removing an individual point can have a sizeable 
effect on the p^; in practice, a region is small if N 
< 20. 

3. If there is no point that has not yet inhabited a 
small region that we have considered so far, the 
search is complete. 

At each stage, the amoeba either grows or shrinks. It 
begins by attempting to grow. The rules for growth are: 

1 . Allow the amoeba to encompass a neighboring data 
point. Force it to encompass any other data points 
necessary to make the expanded amoeba satisfy all 
criteria. Check to see whether the p^ of the ex- 
panded amoeba is less than the p^ of the region 
on the list of the same size. If so, the amoeba has 
successfully grown, the list of the most interesting 
regions is updated, and the amoeba tries to grow 
again. If not, the amoeba shrinks back to its former 
size and repeats the same process using a different 
neighboring data point. 

2. If the amoeba has tried all neighboring data points 
and has not successfully grown, it shrinks. 

The rules for shrinking are: 

1. Force the amoeba to relinquish the data point that 
owns the most background, subject to the require- 
ment that the resulting shrunken amoeba be con- 
sistent with the criteria. 

2. If the amoeba has shrunk out of existence or can 
shrink no further, we kill this amoeba and reseed. 

The result of this process is a list of regions of length 
Adata (one region for each N), such that the N th region 
in the list is the most interesting region of size N found 
in the data set. 



The heuristic Sleuth uses to search for the region of 
greatest excess may usefully be visualized as a set of rules 
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